Version: v1.3.0

Introduction to Bluebook for Bulldozers Dataset

Overview of the Dataset

The Bluebook for Bulldozers dataset, sourced from Kaggle, is a comprehensive compilation of historical auction prices for used bulldozers. It includes detailed information about the equipment, such as model type, usage hours, and age, providing an in-depth view of the market dynamics for these heavy machines.

Importance of the Dataset

This dataset is particularly valuable for understanding price trends and market behavior in the heavy machinery sector. It serves as a rich source for predictive modeling, allowing analysts to forecast future prices and trends in the construction equipment market.

Explainable Machine Learning

Understanding Explainability in ML

Explainable Machine Learning (ML) refers to techniques and methods that make the output of ML models more understandable to humans. This is crucial in scenarios where decision-making needs to be transparent and justifiable.

Role of Explainability in Predictive Modeling

In predictive modeling, especially with complex models like those used in the Bluebook dataset, explainability helps in understanding how and why certain predictions are made. This transparency is vital for trust and validation of model predictions in real-world applications.

Potential Use Case: Analysing Arbitrage Opportunities

Defining Arbitrage in Equipment Sales

Arbitrage, in the context of equipment sales, refers to the opportunity to profit from price discrepancies in different markets or time periods. By accurately predicting the future prices of bulldozers, one can identify undervalued machines to purchase and overvalued machines to sell, capitalising on these market inefficiencies.

Utilising ML Predictions for Arbitrage

This subsection explores how the predictions made by the explainable ML model can be used to identify potential arbitrage opportunities in the bulldozer market. The explainable aspect of the model allows for a deeper understanding of the factors influencing price predictions, aiding in more strategic buying and selling decisions.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, r2_score, mean_squared_error, explained_variance_score, mean_squared_log_error
import matplotlib.pyplot as plt
import requests

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import os

import xplainable as xp
from xplainable.core.ml.regression import XRegressor
from xplainable.core.optimisation.genetic import XEvolutionaryNetwork
from xplainable.core.optimisation.layers import Evolve, Tighten

import xplainable_client
import json

print(f"This notebook was created using Xplainable version {xp.__version__}")

Out:

This notebook was created using Xplainable version 1.3.0

Error plot Function Overwrite

Note: This will be part of the python package as part of v1.1 patch

from sklearn.metrics import mean_absolute_error
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

def plot_error(model, x, y, alpha=0.5, color_column=None):
    fig, ax = plt.subplots(figsize=(12, 8))

    y_pred = model.predict(x)
    mae = mean_absolute_error(y, y_pred)
    errors = abs(y - y_pred)

    if color_column is not None:
        if color_column not in x.columns:
            raise ValueError(f"The color_column {color_column} is not in the DataFrame.")
        
        # Convert column to categorical and get codes and unique values
        categories = x[color_column].astype('category').cat.categories
        codes = x[color_column].astype('category').cat.codes
        unique_codes = np.unique(codes)
        scatter = ax.scatter(y, y_pred, c=codes, alpha=alpha, cmap='plasma')

        # Create a legend with the actual category labels
        handles = [plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=scatter.cmap(scatter.norm(code)), 
                               markersize=10) for code in unique_codes]
        ax.legend(handles, categories, title=color_column)

    else:
        scatter = ax.scatter(y, y_pred, c=errors, alpha=alpha, cmap='plasma')
        plt.colorbar(scatter, ax=ax, label='Absolute Error')

    # Line for perfect predictions
    max_val = np.maximum(y.max(), y_pred.max())
    ax.plot([0, max_val], [0, max_val], 'k--', lw=2)

    # Labels and title
    ax.set_xlabel('True Values')
    ax.set_ylabel('Predicted Values')
    ax.set_title(f'Scatter Plot of True vs Predicted Values with MAE: {mae:.2f}')

    plt.show()

# Example usage:
# plot_error(model, X_train, y_train, alpha=0.5, color_column='ModelID')

Building the Dataset

It's possible to download the Bluebook dozer price prediction dataset at the following link: https://www.kaggle.com/c/bluebook-for-bulldozers/data

Following extraction of the .zip file build the dataset as below:

df = pd.read_csv('https://xplainable-public-storage.syd1.digitaloceanspaces.com/example_data/TrainAndValid.csv', parse_dates = ['saledate'])

df.head()

	SalesID	SalePrice	MachineID	ModelID	datasource	auctioneerID	YearMade	MachineHoursCurrentMeter	UsageBand	saledate	...	Undercarriage_Pad_Width	Stick_Length	Thumb	Pattern_Changer	Grouser_Type	Backhoe_Mounting	Blade_Type	Travel_Controls	Differential_Type	Steering_Controls
0	1139246	66000	999089	3157	121	3	2004	68	Low	2006-11-16	...	nan	nan	nan	nan	nan	nan	nan	nan	Standard	Conventional
1	1139248	57000	117657	77	121	3	1996	4640	Low	2004-03-26	...	nan	nan	nan	nan	nan	nan	nan	nan	Standard	Conventional
2	1139249	10000	434808	7009	121	3	2001	2838	High	2004-02-26	...	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
3	1139251	38500	1026470	332	121	3	2001	3486	High	2011-05-19	...	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
4	1139253	11000	1057373	17311	121	3	2007	722	Medium	2009-07-23	...	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan

Add the machine appendix to concatenate information about the dozer assets

ma = pd.read_csv('https://xplainable-public-storage.syd1.digitaloceanspaces.com/example_data/Machine_Appendix.csv')

ma.head()

	MachineID	ModelID	fiModelDesc	fiBaseModel	fiSecondaryDesc	fiModelSeries	fiModelDescriptor	fiProductClassDesc	ProductGroup	ProductGroupDesc	MfgYear	fiManufacturerID	fiManufacturerDesc	PrimarySizeBasis	PrimaryLower	PrimaryUpper
0	113	1355	350L	350	nan	nan	L	Hydraulic Excavator, Track - 50.0 to 66.0 Metr...	TEX	Track Excavators	1994	26	Caterpillar	Weight - Metric Tons	50	66
1	434	3538	416C	416	C	nan	nan	Backhoe Loader - 14.0 to 15.0 Ft Standard Digg...	BL	Backhoe Loaders	1997	26	Caterpillar	Standard Digging Depth - Ft	14	15
2	534	3538	416C	416	C	nan	nan	Backhoe Loader - 14.0 to 15.0 Ft Standard Digg...	BL	Backhoe Loaders	1998	26	Caterpillar	Standard Digging Depth - Ft	14	15
3	718	3538	416C	416	C	nan	nan	Backhoe Loader - 14.0 to 15.0 Ft Standard Digg...	BL	Backhoe Loaders	2000	26	Caterpillar	Standard Digging Depth - Ft	14	15
4	1753	1580	D5GLGP	D5	G	nan	LGP	Track Type Tractor, Dozer - 85.0 to 105.0 Hors...	TTT	Track Type Tractors	2006	26	Caterpillar	Horsepower	85	105

Merging the dataset on the MachineID to extract useful information:

Find the columns that exist within the machine dictionary that aren't in the training dataset
Merge the new columns on the existing train dataset to enrich the information

new_cols = [col for col in ma.columns if col not in df.columns]
ma[new_cols].head()

	MfgYear	fiManufacturerID	fiManufacturerDesc	PrimarySizeBasis	PrimaryLower	PrimaryUpper
0	1994	26	Caterpillar	Weight - Metric Tons	50	66
1	1997	26	Caterpillar	Standard Digging Depth - Ft	14	15
2	1998	26	Caterpillar	Standard Digging Depth - Ft	14	15
3	2000	26	Caterpillar	Standard Digging Depth - Ft	14	15
4	2006	26	Caterpillar	Horsepower	85	105

merge_col = "MachineID"
df = pd.merge(df, ma[[merge_col]+ new_cols ],on='MachineID', how='left')

Feature Engineering Overview

Note: The approach to feature engineering presented here is foundational and does not encompass the full depth typically seen in data science projects. The release of version 1.1 has improved our system's ability to handle missing values directly, which reduces the time needed to process and analyse data that contains null entries.

# Extract Purchase date information
df['saleyear'] = df['saledate'].dt.year
df['salemonth'] = df['saledate'].dt.month
df['saledayofweek'] = df['saledate'].dt.day_name()

# Drop the Sale Date following extraction of features
df.drop('saledate', inplace=True, axis=1)

#Rename columns
df.columns = [col.replace("_","") for col in df.columns]

#Filter our erroneous purchase values
df = df[df.YearMade > 1920]

#Turn model id into a categorical value so it doesn't create regression splits
df["ModelID"] = df["ModelID"].astype(str) 

df

	SalesID	SalePrice	MachineID	ModelID	datasource	auctioneerID	YearMade	MachineHoursCurrentMeter	UsageBand	fiModelDesc	...	SteeringControls	MfgYear	fiManufacturerID	fiManufacturerDesc	PrimarySizeBasis	PrimaryLower	PrimaryUpper	saleyear	salemonth	saledayofweek
0	1139246	66000.0	999089	3157	121	3.0	2004	68.0	Low	521D	...	Conventional	2004.0	25	Case	Horsepower	110.0	120.0	2006	11	Thursday
1	1139248	57000.0	117657	77	121	3.0	1996	4640.0	Low	950FII	...	Conventional	1996.0	26	Caterpillar	Horsepower	150.0	175.0	2004	3	Friday
2	1139249	10000.0	434808	7009	121	3.0	2001	2838.0	High	226	...	nan	2001.0	26	Caterpillar	Operating Capacity - Lbs	1351.0	1601.0	2004	2	Thursday
3	1139251	38500.0	1026470	332	121	3.0	2001	3486.0	High	PC120-6E	...	nan	2010.0	103	Komatsu	Horsepower	225.0	250.0	2011	5	Thursday
4	1139253	11000.0	1057373	17311	121	3.0	2007	722.0	Medium	S175	...	nan	2007.0	121	Bobcat	Operating Capacity - Lbs	1601.0	1751.0	2009	7	Thursday
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
412693	6333344	10000.0	1919201	21435	149	2.0	2005	nan	nan	30NX	...	nan	2005.0	2552	IHI	Weight - Metric Tons	2.0	3.0	2012	3	Wednesday
412694	6333345	10500.0	1882122	21436	149	2.0	2005	nan	nan	30NX2	...	nan	2005.0	2552	IHI	Weight - Metric Tons	3.0	4.0	2012	1	Saturday
412695	6333347	12500.0	1944213	21435	149	2.0	2005	nan	nan	30NX	...	nan	2005.0	2552	IHI	Weight - Metric Tons	2.0	3.0	2012	1	Saturday
412696	6333348	10000.0	1794518	21435	149	2.0	2006	nan	nan	30NX	...	nan	2006.0	2552	IHI	Weight - Metric Tons	2.0	3.0	2012	3	Wednesday
412697	6333349	13000.0	1944743	21436	149	2.0	2006	nan	nan	30NX2	...	nan	2005.0	2552	IHI	Weight - Metric Tons	2.0	3.0	2012	1	Saturday

Train on the top 6 dozers assets by count

For timeliness of training filter the data on the Top 6 assets by count

models_to_train = df.ModelID.value_counts().index[:10].to_list()

models_to_train

Out:

['4605',
'3538',
'4604',
'3170',
'3362',
'3537',
'4603',
'3171',
'3357',
'3178']

df[df.ModelID.isin(models_to_train)]

	SalesID	SalePrice	MachineID	ModelID	datasource	auctioneerID	YearMade	MachineHoursCurrentMeter	UsageBand	fiModelDesc	...	SteeringControls	MfgYear	fiManufacturerID	fiManufacturerDesc	PrimarySizeBasis	PrimaryLower	PrimaryUpper	saleyear	salemonth	saledayofweek
5	1139255	26500.0	1001274	4605	121	3.0	2004	508.0	Low	310G	...	nan	2004.0	43	John Deere	Standard Digging Depth - Ft	14.0	15.0	2008	12	Thursday
10	1139278	24000.0	1024998	4605	121	3.0	2004	1414.0	Medium	310G	...	nan	2004.0	43	John Deere	Standard Digging Depth - Ft	14.0	15.0	2008	8	Thursday
15	1139291	19000.0	1004810	4604	121	3.0	1999	2450.0	Medium	310E	...	nan	1999.0	43	John Deere	Standard Digging Depth - Ft	14.0	15.0	2006	11	Thursday
62	1139469	23000.0	1058869	3171	121	3.0	1998	9987.0	High	580L	...	nan	1998.0	25	Case	Standard Digging Depth - Ft	14.0	15.0	2007	5	Thursday
82	1139515	33000.0	1015565	4605	121	3.0	2002	1268.0	Medium	310G	...	nan	2002.0	43	John Deere	Standard Digging Depth - Ft	14.0	15.0	2004	7	Thursday
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
410243	6288239	18200.0	1835461	4604	149	99.0	2000	48.0	Low	310E	...	nan	2000.0	43	John Deere	Standard Digging Depth - Ft	14.0	15.0	2012	2	Wednesday
410244	6288240	25250.0	1903914	4605	149	0.0	2005	1988.0	Low	310G	...	nan	2005.0	43	John Deere	Standard Digging Depth - Ft	14.0	15.0	2012	1	Saturday
410245	6288241	25250.0	1860549	4605	149	99.0	2006	nan	nan	310G	...	nan	2006.0	43	John Deere	Standard Digging Depth - Ft	14.0	15.0	2012	4	Wednesday
410246	6288243	25000.0	1846184	4605	149	1.0	2006	nan	nan	310G	...	nan	2006.0	43	John Deere	Standard Digging Depth - Ft	14.0	15.0	2012	3	Thursday
410264	6288346	20500.0	1867087	4604	149	4.0	2000	nan	nan	310E	...	nan	2000.0	43	John Deere	Standard Digging Depth - Ft	14.0	15.0	2012	2	Monday

data = df[df.ModelID.isin(models_to_train)]
m = data.isna().sum()
data = data[[col for col in data.columns if col not in data.columns[m == len(data)]]]

#Drop cols cardinality of 1
s = data.nunique()
car_cols = data.columns[(s == 1)]
data = data.drop(columns=car_cols)

#Update numeric columns to be float64
n_cols = data.select_dtypes(include=np.number).columns.tolist()
data[n_cols] = data[n_cols].astype('float64')

data.head()

	SalesID	SalePrice	MachineID	ModelID	datasource	auctioneerID	YearMade	MachineHoursCurrentMeter	UsageBand	fiModelDesc	...	TireSize	MfgYear	fiManufacturerID	fiManufacturerDesc	PrimarySizeBasis	PrimaryLower	PrimaryUpper	saleyear	salemonth	saledayofweek
5	1.13926e+06	26500	1.00127e+06	4605	121	3	2004	508	Low	310G	...	nan	2004	43	John Deere	Standard Digging Depth - Ft	14	15	2008	12	Thursday
10	1.13928e+06	24000	1.025e+06	4605	121	3	2004	1414	Medium	310G	...	nan	2004	43	John Deere	Standard Digging Depth - Ft	14	15	2008	8	Thursday
15	1.13929e+06	19000	1.00481e+06	4604	121	3	1999	2450	Medium	310E	...	nan	1999	43	John Deere	Standard Digging Depth - Ft	14	15	2006	11	Thursday
62	1.13947e+06	23000	1.05887e+06	3171	121	3	1998	9987	High	580L	...	nan	1998	25	Case	Standard Digging Depth - Ft	14	15	2007	5	Thursday
82	1.13952e+06	33000	1.01556e+06	4605	121	3	2002	1268	Medium	310G	...	nan	2002	43	John Deere	Standard Digging Depth - Ft	14	15	2004	7	Thursday

Addressing Multicollinearity in Model Interpretability

It's well-understood in data science that multicollinearity can significantly hamper the interpretability of models, particularly those based on linear assumptions. The code snippet above demonstrates a rudimentary approach to mitigating multicollinearity by removing highly correlated features. However, it's important to acknowledge that this is a simplified illustration; in practice, the interplay between features can be more subtle and complex.

For robust feature selection and to enhance model explainability, we employ automated feature selection techniques that are thoroughly documented in our project's documentation. These methods go beyond pairwise correlations, considering the multidimensional structure of the data to retain the most informative features. While the current example is not exhaustive, it serves to highlight a fundamental step in preprocessing for linear models. Practitioners are encouraged to leverage our automatic feature selection capabilities to refine their models further and to ensure that the explanatory variables employed are truly reflective of independent factors influencing the response variable.

data["AgeAtSale"] = df["saleyear"] - df["MfgYear"]

drop_cols = [
    # "saleyear", #--> Data encoded in Age at Sale
    "MfgYear", #--> Data encoded in Age at Sale
    "YearMade", #--> Multicollinearity with MfgYear
            ]

target = 'SalePrice'
id_columns=["SalesID",'MachineID','auctioneerID','datasource']

Split the train and validation set

data_train = data[data.saleyear!=2012]
data_val = data[data.saleyear==2012]
data_train= data_train.drop(columns=drop_cols)
data_val=data_val.drop(columns=drop_cols)

#Create the training and validation set
X_train, y_train = data_train.drop('SalePrice', axis=1), data_train['SalePrice']
X_valid, y_valid = data_val.drop('SalePrice', axis=1), data_val['SalePrice']

model = XRegressor(ignore_nan=False)

Initial fit of the Regressor

model.fit(X_train, y_train, id_columns=id_columns)

Out:

<xplainable.core.ml.regression.XRegressor at 0x285e86890>

model.evaluate(X_train, y_train)

Out:

{'Explained Variance': 0.8476,

'MAE': 4292.3689,

'MAPE': 0.1616,

'MSE': 39079631.4865,

'RMSE': 6251.3704,

'RMSLE': nan,

'R2 Score': 0.8476}

model.explain()

Optimising the Model

network = XEvolutionaryNetwork(model)

# Add the layers
# Start with an initial Tighten layer
network.add_layer(
    Tighten(
        iterations=100,
        learning_rate=0.1,
        early_stopping=20
        )
    )

# Add an Evolve layer with a high severity
network.add_layer(
    Evolve(
        mutations=100,
        generations=50,
        max_severity=0.5,
        max_leaves=20,
        early_stopping=20
        )
    )

# Add another Evolve layer with a lower severity and reach
network.add_layer(
    Evolve(
        mutations=100,
        generations=50,
        max_severity=0.3,
        max_leaves=15,
        early_stopping=20
        )
    )

# Add a final Tighten layer with a low learning rate
network.add_layer(
    Tighten(
        iterations=100,
        learning_rate=0.025,
        early_stopping=20
        )
    )

# Fit the network (before or after adding layers)
network.fit(X_train.drop(columns=id_columns), y_train)

# Run the network
network.optimise()

Out:

  0%|          | 0/100 [00:00<?, ?it/s]
0%|          | 0/50 [00:00<?, ?it/s]
0%|          | 0/50 [00:00<?, ?it/s]
0%|          | 0/100 [00:00<?, ?it/s]
<xplainable.core.optimisation.genetic.XEvolutionaryNetwork at 0x2cde7c310>

model.evaluate(X_train, y_train)

Out:

{'Explained Variance': 0.8443,

'MAE': 4129.7996,

'MAPE': 0.148,

'MSE': 40161519.9342,

'RMSE': 6337.3117,

'RMSLE': nan,

'R2 Score': 0.8434}

Simply by fitting a combination of 6 Tighten and Evolution layers we have decreased the MAE by approximately 90. Play around with more layers to see if it's possible to obtain better results.

plot_error(model, X_train, y_train)

Comparing against the validation set

model.evaluate(X_valid, y_valid)

Out:

{'Explained Variance': 0.8453,

'MAE': 4859.9945,

'MAPE': 0.1788,

'MSE': 46180914.83,

'RMSE': 6795.6541,

'RMSLE': nan,

'R2 Score': 0.8057}

plot_error(model, X_valid, y_valid)

model.explain()

Explaining the variance in the Error Plot

Prior to examining the detailed error plot, it is essential to consider the real-world operational differences among various bulldozer models, as well as the insights provided by subject matter experts (SMEs). These differences are likely to manifest as distinct groupings in the predicted versus actual results. Each model type's unique characteristics—such as age, usage and maintenance history factors that could create these groups, affecting the sale prices and thus the prediction accuracy. Recognizing these potential variances will prepare us to understand and address the disparities in the predictive performance across different Model IDs that the following plot will reveal.

plot_error(model, X_train, y_train, alpha=0.4, color_column="ModelID")

Insights from Scatter Plot Analysis

The scatter plot displayed above demonstrates a significant variation in the predictive accuracy across different Model IDs, as indicated by the spread of points in relation to the black dashed line, which represents perfect prediction. Models such as those in the yellow cluster are closely aligned with the line, suggesting higher prediction accuracy for these Model IDs. This observation underscores the importance of partitioning the dataset to develop model-specific predictive algorithms. By doing so, we can account for the unique characteristics of each model, which may include factors specific to the model that affect the score contributions.

Persisting to Xplainable Cloud

Step 1: Instantiate the Client

Connect to the Xplainable API using your provided API key and the local hostname. This allows further interaction with the platform for model creation and deployment.

#Instantiating the client
client = xplainable_client.Client(
    api_key="f2be58ea-8253-4ea2-b1d8-0a9cb43bf91f",#<- Add your own token here
    hostname="http://localhost:8000"
)

Step 2: Create a Model

Define and create a machine learning model on the Xplainable platform. This includes setting a name, description, and providing training features (X_train) and targets (y_train).

# Create a model
model_id = client.create_model(
    model=model,
    model_name="Dozer Price Prediction",
    model_description="Predicting the price of a different dozer types",
    x=X_train,
    y=y_train
)

Out:

  0%|          | 0/41 [00:00<?, ?it/s]
<Response [200]>

Step 3: Deploy the Model

Deploy the model to make it available for inference. You’ll use the version ID returned from the model creation step to deploy this specific version.

deployment = client.deploy(
    model_version_id=model_id["version_id"] #<- Use version id produced above
)

Out:

<Response [200]>

Step 4: Activate the Deployment

Activate the model deployment so that it’s ready to receive inference requests.

client.activate_deployment(deployment['deployment_id'])

Out:

{'message': 'activated deployment'}

Step 5: Generate a Deploy Key

Generate an API deploy key for secure access to the deployed model. This key will be used to authenticate when making prediction requests.

deploy_key = client.generate_deploy_key(deployment['deployment_id'],'API key for Dozer Price Prediction', 7)

Out:

<Response [200]>

Step 6: Format a Sample Input

Prepare a single test sample (excluding the target column SalePrice) to be used for model inference. This sample is converted to JSON format for use in an API call.

body = json.loads(
    df[df.ModelID == "4605"].drop(columns=["SalePrice"]).sample(1).to_json(orient="records")
    )

body

Out:

[{'SalesID': 1638768,

'MachineID': 1520710,

'ModelID': '4605',

'datasource': 132,

'auctioneerID': 1.0,

'YearMade': 2001,

'MachineHoursCurrentMeter': None,

'UsageBand': None,

'fiModelDesc': '310G',

'fiBaseModel': '310',

'fiSecondaryDesc': 'G',

'fiModelSeries': None,

'fiModelDescriptor': None,

'ProductSize': None,

'fiProductClassDesc': 'Backhoe Loader - 14.0 to 15.0 Ft Standard Digging Depth',

'state': 'Maryland',

'ProductGroup': 'BL',

'ProductGroupDesc': 'Backhoe Loaders',

'DriveSystem': 'Four Wheel Drive',

'Enclosure': 'EROPS',

'Forks': 'None or Unspecified',

'PadType': 'None or Unspecified',

'RideControl': 'No',

'Stick': 'Extended',

'Transmission': 'Standard',

'Turbocharged': 'None or Unspecified',

'BladeExtension': None,

'BladeWidth': None,

'EnclosureType': None,

'EngineHorsepower': None,

'Hydraulics': None,

'Pushblock': None,

'Ripper': None,

'Scarifier': None,

'TipControl': None,

'TireSize': None,

'Coupler': None,

'CouplerSystem': None,

'GrouserTracks': None,

'HydraulicsFlow': None,

'TrackType': None,

'UndercarriagePadWidth': None,

'StickLength': None,

'Thumb': None,

'PatternChanger': None,

'GrouserType': None,

'BackhoeMounting': None,

'BladeType': None,

'TravelControls': None,

'DifferentialType': None,

'SteeringControls': None,

'MfgYear': 2001.0,

'fiManufacturerID': 43,

'fiManufacturerDesc': 'John Deere',

'PrimarySizeBasis': 'Standard Digging Depth - Ft',

'PrimaryLower': 14.0,

'PrimaryUpper': 15.0,

'saleyear': 2004,

'salemonth': 10,

'saledayofweek': 'Wednesday'}]

Step 7: Send a Prediction Request

Make a POST request to the Xplainable inference endpoint with the sample input. The deploy_key is included in the headers for authentication, and the model returns a prediction based on the JSON-formatted input data.

response = requests.post(
    url="https://inference.xplainable.io/v1/predict",
    headers={'api_key': deploy_key['deploy_key']},
    json=body
)

value = response.json()
value

Out:

[{'index': 0,
'id': '1638768-1520710-1-132',
'partition': '__dataset__',
'pred': 23689.41691069905,
'breakdown': [{'feature': 'base_value',
  'value': None,
  'score': 28706.30780882021},
 {'feature': 'ModelID', 'value': '4605', 'score': 0.0},
 {'feature': 'MachineHoursCurrentMeter',
  'value': 'nan',
  'score': 4542.336047069585},
 {'feature': 'UsageBand', 'value': 'nan', 'score': -2.376759178725753e-13},
 {'feature': 'fiModelDesc', 'value': '310G', 'score': -0.0},
 {'feature': 'fiBaseModel', 'value': '310', 'score': 0.0},
 {'feature': 'fiSecondaryDesc', 'value': 'G', 'score': 0.0},
 {'feature': 'fiProductClassDesc',
  'value': 'Backhoe Loader - 14.0 to 15.0 Ft Standard Digging Depth',
  'score': -0.0},
 {'feature': 'state', 'value': 'Maryland', 'score': -0.0},
 {'feature': 'ProductGroup', 'value': 'BL', 'score': -0.0},
 {'feature': 'ProductGroupDesc', 'value': 'Backhoe Loaders', 'score': -0.0},
 {'feature': 'DriveSystem', 'value': 'Four Wheel Drive', 'score': -0.0},
 {'feature': 'Enclosure', 'value': 'EROPS', 'score': 0.0},
 {'feature': 'Forks', 'value': 'None or Unspecified', 'score': -0.0},
 {'feature': 'PadType', 'value': 'None or Unspecified', 'score': -0.0},
 {'feature': 'RideControl', 'value': 'No', 'score': -0.0},
 {'feature': 'Stick', 'value': 'Extended', 'score': -0.0},
 {'feature': 'Transmission', 'value': 'Standard', 'score': -0.0},
 {'feature': 'Turbocharged', 'value': 'None or Unspecified', 'score': -0.0},
 {'feature': 'BladeExtension',
  'value': 'nan',
  'score': 1.7494329609441378e-10},
 {'feature': 'BladeWidth', 'value': 'nan', 'score': 2.7177059816568793e-10},
 {'feature': 'EnclosureType',
  'value': 'nan',
  'score': 5.871688102779908e-10},
 {'feature': 'EngineHorsepower',
  'value': 'nan',
  'score': 1.8710924183226608e-10},
 {'feature': 'Hydraulics', 'value': 'nan', 'score': 8.581227981031664e-10},
 {'feature': 'Pushblock', 'value': 'nan', 'score': 3.9066875559563424e-10},
 {'feature': 'Ripper', 'value': 'nan', 'score': 3.375317912606129e-10},
 {'feature': 'Scarifier', 'value': 'nan', 'score': 8.744863232788086e-10},
 {'feature': 'TipControl', 'value': 'nan', 'score': 2.3027298498497101e-10},
 {'feature': 'TireSize', 'value': 'nan', 'score': 1.5156128429182077e-10},
 {'feature': 'fiManufacturerID', 'value': '43', 'score': -2372.963248916497},
 {'feature': 'fiManufacturerDesc', 'value': 'John Deere', 'score': -0.0},
 {'feature': 'PrimarySizeBasis',
  'value': 'Standard Digging Depth - Ft',
  'score': -0.0},
 {'feature': 'PrimaryLower', 'value': '14', 'score': -387.094451758901},
 {'feature': 'PrimaryUpper', 'value': '15', 'score': -363.50468840760936},
 {'feature': 'saleyear', 'value': '2004', 'score': -6336.19859837548},
 {'feature': 'salemonth', 'value': '10', 'score': -99.4659577363224},
 {'feature': 'saledayofweek', 'value': 'Wednesday', 'score': -0.0}]}]

Partitioned Models

The Power of Partitioned Models in Price Prediction

When predicting prices for heavy equipment like in the Bluebook Dozer Price Prediction challenge, one-size-fits-all models often fall short. Different equipment models (ModelIDs) can have vastly different characteristics—age, usage patterns, depreciation curves, and market dynamics. Trying to capture all of that in a single global model can dilute performance.

What is a Partitioned Model?

A partitioned model means training separate models for each subgroup or partition in the data—in this case, for each unique ModelID. Instead of fitting one global model to the entire dataset, you're allowing the model to specialize based on contextual differences.

In Xplainable, this can be achieved seamlessly by training per-group models through the auto-training UI or the client.

from xplainable.core.models import PartitionedRegressor, XRegressor

# Train your model (this will open an embedded gui)
partitioned_model = PartitionedRegressor(partition_on='ModelID')

# Iterate over the unique values in the partition column
for partition in df.ModelID.value_counts().index[:10].to_list():
    # Get the data for the partition
    part = data[data['ModelID'] == partition].drop(columns=drop_cols)
    
    # IMPORTANT: Exclude the partition column (ModelID) from features
    partition_features = [col for col in part.columns if col not in ['SalePrice', 'ModelID']]
    x_train_partition = part[partition_features]
    y_train_partition = part['SalePrice']
    
    # Fit the embedded model
    model_partition = XRegressor()
    model_partition.fit(x_train_partition, y_train_partition, id_columns=id_columns)
    
    # Add the model to the partitioned model
    partitioned_model.add_partition(model_partition, partition)

# IMPORTANT: Add the __dataset__ partition (full dataset model)
# Also exclude ModelID from the full dataset features
full_dataset_features = [col for col in X_train.columns if col != 'ModelID']
X_train_no_modelid = X_train[full_dataset_features]

full_model = XRegressor()
full_model.fit(X_train_no_modelid, y_train, id_columns=id_columns)
partitioned_model.add_partition(full_model, '__dataset__')

# Now you can predict on the partitioned model
# The prediction will use ModelID for routing, but not as a feature
y_pred = partitioned_model.predict(X_valid)

plot_error(partitioned_model, X_train, y_train, color_column="ModelID")

plot_error(partitioned_model, X_valid, y_valid, color_column="ModelID")

Evaluation of Model Predictions Against Validation Data

The scatter plot illustrates our model's performance on the validation set, comparing the true values against the predicted values for various bulldozer models. While the trend line shows that our model predictions are generally aligned with the true values, there is an observable underprediction across the data points, as evidenced by the mean absolute error (MAE) of 3599 vs 3212 on the train.

The impact of the mining boom in Australia in 2012, referenced from the Reserve Bank of Australia's report, suggests an economic context that may influence equipment prices. Incorporating macroeconomic indicators could potentially enhance the model's predictive accuracy.
Introducing time series features that capture year-over-year changes could offer a more nuanced understanding of price fluctuations over time, rather than relying solely on 'Age at Sale', which may not fully encapsulate such trends.

These considerations point towards the inclusion of external economic factors and more sophisticated time-based features to improve the model's prediction capabilities. Further analysis and iterative model tuning will be required to reduce the prediction error and align the model outputs more closely with the validation data.

Further Investigation:

An analysis of the trend line derived from time series splits (Age at Sale) could reveal insights into future forecasting capabilities. By extending this trend line, we can project forward forecasts that anticipate equipment prices. This approach could be particularly beneficial for capturing the trajectory of market shifts influenced by macroeconomic trends, such as the mining boom.

Should anyone be interested in contributing to the development of this predictive feature or investigating this further, please feel free to add to the issues on our repository or contact us directly at [email protected].

Access model partitions and plot explanations

partitioned_model.partitions

Out:

{'4605': <xplainable.core.ml.regression.XRegressor at 0x2b9265a50>,
'3538': <xplainable.core.ml.regression.XRegressor at 0x2acfb4220>,
'4604': <xplainable.core.ml.regression.XRegressor at 0x2acfb5ed0>,
'3170': <xplainable.core.ml.regression.XRegressor at 0x2acfce980>,
'3362': <xplainable.core.ml.regression.XRegressor at 0x2b89a76d0>,
'3537': <xplainable.core.ml.regression.XRegressor at 0x2b8940160>,
'4603': <xplainable.core.ml.regression.XRegressor at 0x2b89b7df0>,
'3171': <xplainable.core.ml.regression.XRegressor at 0x2b8955ff0>,
'3357': <xplainable.core.ml.regression.XRegressor at 0x2b8943040>,
'3178': <xplainable.core.ml.regression.XRegressor at 0x2b9264490>,
'__dataset__': <xplainable.core.ml.regression.XRegressor at 0x2ad193850>}

partitioned_model.partitions['3170'].explain()

# Create a model
model_id = client.create_model(
    model=partitioned_model,
    model_name="Dozer Partitioned Model",
    model_description="Predicting the price of a different dozer types partitioned on ModelID",
    x=X_train,
    y=y_train
)

Out:

0%| | 0/41 [00:00<?, ?it/s]

Skipping column BladeExtension because it is completely NaN.

Skipping column BladeWidth because it is completely NaN.

Skipping column EnclosureType because it is completely NaN.

Skipping column EngineHorsepower because it is completely NaN.

Skipping column Hydraulics because it is completely NaN.

Skipping column Pushblock because it is completely NaN.

Skipping column Ripper because it is completely NaN.

Skipping column Scarifier because it is completely NaN.

Skipping column TipControl because it is completely NaN.

Skipping column TireSize because it is completely NaN.

0%| | 0/41 [00:00<?, ?it/s]

Skipping column BladeExtension because it is completely NaN.

Skipping column BladeWidth because it is completely NaN.

Skipping column EnclosureType because it is completely NaN.

Skipping column EngineHorsepower because it is completely NaN.

Skipping column Hydraulics because it is completely NaN.

Skipping column Pushblock because it is completely NaN.

Skipping column Ripper because it is completely NaN.

Skipping column Scarifier because it is completely NaN.

Skipping column TipControl because it is completely NaN.

Skipping column TireSize because it is completely NaN.

0%| | 0/41 [00:00<?, ?it/s]

Skipping column BladeExtension because it is completely NaN.

Skipping column BladeWidth because it is completely NaN.

Skipping column EnclosureType because it is completely NaN.

Skipping column EngineHorsepower because it is completely NaN.

Skipping column Hydraulics because it is completely NaN.

Skipping column Pushblock because it is completely NaN.

Skipping column Ripper because it is completely NaN.

Skipping column Scarifier because it is completely NaN.

Skipping column TipControl because it is completely NaN.

Skipping column TireSize because it is completely NaN.

0%| | 0/41 [00:00<?, ?it/s]

Skipping column BladeExtension because it is completely NaN.

Skipping column BladeWidth because it is completely NaN.

Skipping column EnclosureType because it is completely NaN.

Skipping column EngineHorsepower because it is completely NaN.

Skipping column Hydraulics because it is completely NaN.

Skipping column Pushblock because it is completely NaN.

Skipping column Ripper because it is completely NaN.

Skipping column Scarifier because it is completely NaN.

Skipping column TipControl because it is completely NaN.

Skipping column TireSize because it is completely NaN.

0%| | 0/41 [00:00<?, ?it/s]

Skipping column Forks because it is completely NaN.

Skipping column PadType because it is completely NaN.

Skipping column RideControl because it is completely NaN.

Skipping column Stick because it is completely NaN.

Skipping column Turbocharged because it is completely NaN.

0%| | 0/41 [00:00<?, ?it/s]

Skipping column BladeExtension because it is completely NaN.

Skipping column BladeWidth because it is completely NaN.

Skipping column EnclosureType because it is completely NaN.

Skipping column EngineHorsepower because it is completely NaN.

Skipping column Hydraulics because it is completely NaN.

Skipping column Pushblock because it is completely NaN.

Skipping column Ripper because it is completely NaN.

Skipping column Scarifier because it is completely NaN.

Skipping column TipControl because it is completely NaN.

Skipping column TireSize because it is completely NaN.

0%| | 0/41 [00:00<?, ?it/s]

Skipping column BladeExtension because it is completely NaN.

Skipping column BladeWidth because it is completely NaN.

Skipping column EnclosureType because it is completely NaN.

Skipping column EngineHorsepower because it is completely NaN.

Skipping column Hydraulics because it is completely NaN.

Skipping column Pushblock because it is completely NaN.

Skipping column Ripper because it is completely NaN.

Skipping column Scarifier because it is completely NaN.

Skipping column TipControl because it is completely NaN.

Skipping column TireSize because it is completely NaN.

0%| | 0/41 [00:00<?, ?it/s]

Skipping column BladeExtension because it is completely NaN.

Skipping column BladeWidth because it is completely NaN.

Skipping column EnclosureType because it is completely NaN.

Skipping column EngineHorsepower because it is completely NaN.

Skipping column Hydraulics because it is completely NaN.

Skipping column Pushblock because it is completely NaN.

Skipping column Ripper because it is completely NaN.

Skipping column Scarifier because it is completely NaN.

Skipping column TipControl because it is completely NaN.

Skipping column TireSize because it is completely NaN.

0%| | 0/41 [00:00<?, ?it/s]

Skipping column Forks because it is completely NaN.

Skipping column PadType because it is completely NaN.

Skipping column RideControl because it is completely NaN.

Skipping column Stick because it is completely NaN.

Skipping column Turbocharged because it is completely NaN.

0%| | 0/41 [00:00<?, ?it/s]

Skipping column BladeExtension because it is completely NaN.

Skipping column BladeWidth because it is completely NaN.

Skipping column EnclosureType because it is completely NaN.

Skipping column EngineHorsepower because it is completely NaN.

Skipping column Hydraulics because it is completely NaN.

Skipping column Pushblock because it is completely NaN.

Skipping column Ripper because it is completely NaN.

Skipping column Scarifier because it is completely NaN.

Skipping column TipControl because it is completely NaN.

Skipping column TireSize because it is completely NaN.

0%| | 0/41 [00:00<?, ?it/s]

Step 3: Deploy the Model

Deploy the model to make it available for inference. You’ll use the version ID returned from the model creation step to deploy this specific version.

deployment = client.deploy(
    model_version_id=model_id["version_id"] #<- Use version id produced above
)

Out:

<Response [200]>

Step 4: Activate the Deployment

Activate the model deployment so that it’s ready to receive inference requests.

client.activate_deployment(deployment['deployment_id'])

Out:

{'message': 'activated deployment'}

Step 5: Generate a Deploy Key

Generate an API deploy key for secure access to the deployed model. This key will be used to authenticate when making prediction requests.

deploy_key = client.generate_deploy_key(deployment['deployment_id'],'API key for Dozer Price Prediction', 7)

Out:

<Response [200]>

Step 6: Format a Sample Input

Prepare a single test sample (excluding the target column SalePrice) to be used for model inference. This sample is converted to JSON format for use in an API call.

body = json.loads(
    df[df.ModelID == "4605"].drop(columns=["SalePrice"]).sample(1).to_json(orient="records")
    )

body

Out:

[{'SalesID': 2644612,

'MachineID': 1877801,

'ModelID': '4605',

'datasource': 149,

'auctioneerID': 1.0,

'YearMade': 2001,

'MachineHoursCurrentMeter': 3834.0,

'UsageBand': 'Medium',

'fiModelDesc': '310G',

'fiBaseModel': '310',

'fiSecondaryDesc': 'G',

'fiModelSeries': None,

'fiModelDescriptor': None,

'ProductSize': None,

'fiProductClassDesc': 'Backhoe Loader - 14.0 to 15.0 Ft Standard Digging Depth',

'state': 'Nevada',

'ProductGroup': 'BL',

'ProductGroupDesc': 'Backhoe Loaders',

'DriveSystem': 'Two Wheel Drive',

'Enclosure': 'EROPS w AC',

'Forks': 'None or Unspecified',

'PadType': 'None or Unspecified',

'RideControl': 'No',

'Stick': 'Standard',

'Transmission': 'Standard',

'Turbocharged': 'None or Unspecified',

'BladeExtension': None,

'BladeWidth': None,

'EnclosureType': None,

'EngineHorsepower': None,

'Hydraulics': None,

'Pushblock': None,

'Ripper': None,

'Scarifier': None,

'TipControl': None,

'TireSize': None,

'Coupler': None,

'CouplerSystem': None,

'GrouserTracks': None,

'HydraulicsFlow': None,

'TrackType': None,

'UndercarriagePadWidth': None,

'StickLength': None,

'Thumb': None,

'PatternChanger': None,

'GrouserType': None,

'BackhoeMounting': None,

'BladeType': None,

'TravelControls': None,

'DifferentialType': None,

'SteeringControls': None,

'MfgYear': 2001.0,

'fiManufacturerID': 43,

'fiManufacturerDesc': 'John Deere',

'PrimarySizeBasis': 'Standard Digging Depth - Ft',

'PrimaryLower': 14.0,

'PrimaryUpper': 15.0,

'saleyear': 2011,

'salemonth': 6,

'saledayofweek': 'Friday'}]

Step 7: Send a Prediction Request

response = requests.post(
    url="https://inference.xplainable.io/v1/predict",
    headers={'api_key': deploy_key['deploy_key']},
    json=body
)

value = response.json()
value

Out:

[{'index': 0,
'id': '2644612-1877801-1-149',
'partition': '4605',
'pred': 27167.26121041039,
'breakdown': [{'feature': 'base_value',
  'value': None,
  'score': 25718.595440841083},
 {'feature': 'MachineHoursCurrentMeter',
  'value': '3834',
  'score': 1451.6132550506145},
 {'feature': 'UsageBand', 'value': 'Medium', 'score': 0.0},
 {'feature': 'fiModelDesc', 'value': '310G', 'score': 0.0},
 {'feature': 'fiBaseModel', 'value': '310', 'score': 0.0},
 {'feature': 'fiSecondaryDesc', 'value': 'G', 'score': 0.0},
 {'feature': 'fiProductClassDesc',
  'value': 'Backhoe Loader - 14.0 to 15.0 Ft Standard Digging Depth',
  'score': 0.0},
 {'feature': 'state', 'value': 'Nevada', 'score': -0.0},
 {'feature': 'ProductGroup', 'value': 'BL', 'score': 0.0},
 {'feature': 'ProductGroupDesc', 'value': 'Backhoe Loaders', 'score': 0.0},
 {'feature': 'DriveSystem', 'value': 'Two Wheel Drive', 'score': -0.0},
 {'feature': 'Enclosure', 'value': 'EROPS w AC', 'score': 0.0},
 {'feature': 'Forks', 'value': 'None or Unspecified', 'score': -0.0},
 {'feature': 'PadType', 'value': 'None or Unspecified', 'score': -0.0},
 {'feature': 'RideControl', 'value': 'No', 'score': -0.0},
 {'feature': 'Stick', 'value': 'Standard', 'score': -0.0},
 {'feature': 'Transmission', 'value': 'Standard', 'score': 0.0},
 {'feature': 'Turbocharged', 'value': 'None or Unspecified', 'score': -0.0},
 {'feature': 'BladeExtension', 'value': 'nan', 'score': 0.0},
 {'feature': 'BladeWidth', 'value': 'nan', 'score': 0.0},
 {'feature': 'EnclosureType', 'value': 'nan', 'score': 0.0},
 {'feature': 'EngineHorsepower', 'value': 'nan', 'score': 0.0},
 {'feature': 'Hydraulics', 'value': 'nan', 'score': 0.0},
 {'feature': 'Pushblock', 'value': 'nan', 'score': 0.0},
 {'feature': 'Ripper', 'value': 'nan', 'score': 0.0},
 {'feature': 'Scarifier', 'value': 'nan', 'score': 0.0},
 {'feature': 'TipControl', 'value': 'nan', 'score': 0.0},
 {'feature': 'TireSize', 'value': 'nan', 'score': 0.0},
 {'feature': 'fiManufacturerID', 'value': '43', 'score': 0.0},
 {'feature': 'fiManufacturerDesc', 'value': 'John Deere', 'score': 0.0},
 {'feature': 'PrimarySizeBasis',
  'value': 'Standard Digging Depth - Ft',
  'score': -0.0},
 {'feature': 'PrimaryLower', 'value': '14', 'score': 0.0},
 {'feature': 'PrimaryUpper', 'value': '15', 'score': 0.0},
 {'feature': 'saleyear', 'value': '2011', 'score': 0.2771819015112775},
 {'feature': 'salemonth', 'value': '6', 'score': -3.2246673828189247},
 {'feature': 'saledayofweek', 'value': 'Friday', 'score': -0.0}]}]

Introduction to Bluebook for Bulldozers Dataset

Overview of the Dataset​

Importance of the Dataset​

Explainable Machine Learning

Understanding Explainability in ML​

Role of Explainability in Predictive Modeling​

Potential Use Case: Analysing Arbitrage Opportunities​

Defining Arbitrage in Equipment Sales​

Utilising ML Predictions for Arbitrage​

Error plot Function Overwrite​

Building the Dataset

Feature Engineering Overview

Train on the top 6 dozers assets by count​

Addressing Multicollinearity in Model Interpretability​

Split the train and validation set​

Initial fit of the Regressor​

Optimising the Model​

Explaining the variance in the Error Plot​

Insights from Scatter Plot Analysis​

Persisting to Xplainable Cloud

Step 1: Instantiate the Client​

Step 2: Create a Model​

Step 3: Deploy the Model​

Step 4: Activate the Deployment​

Step 5: Generate a Deploy Key​

Step 6: Format a Sample Input​

Step 7: Send a Prediction Request​

Partitioned Models

The Power of Partitioned Models in Price Prediction​

What is a Partitioned Model?​

Evaluation of Model Predictions Against Validation Data​

Considerations for Model Refinement:​

Further Investigation:​

Access model partitions and plot explanations​

Step 3: Deploy the Model​

Step 4: Activate the Deployment​

Step 5: Generate a Deploy Key​

Step 6: Format a Sample Input​

Step 7: Send a Prediction Request​

Overview of the Dataset

Importance of the Dataset

Understanding Explainability in ML

Role of Explainability in Predictive Modeling

Potential Use Case: Analysing Arbitrage Opportunities

Defining Arbitrage in Equipment Sales

Utilising ML Predictions for Arbitrage

Error plot Function Overwrite

Train on the top 6 dozers assets by count

Addressing Multicollinearity in Model Interpretability

Split the train and validation set

Initial fit of the Regressor

Optimising the Model

Explaining the variance in the Error Plot

Insights from Scatter Plot Analysis

Step 1: Instantiate the Client

Step 2: Create a Model

Step 3: Deploy the Model

Step 4: Activate the Deployment

Step 5: Generate a Deploy Key

Step 6: Format a Sample Input

Step 7: Send a Prediction Request

The Power of Partitioned Models in Price Prediction

What is a Partitioned Model?

Evaluation of Model Predictions Against Validation Data

Considerations for Model Refinement:

Further Investigation:

Access model partitions and plot explanations

Step 3: Deploy the Model

Step 4: Activate the Deployment

Step 5: Generate a Deploy Key

Step 6: Format a Sample Input

Step 7: Send a Prediction Request